Materializing the editing history of Wikipedia as linked Data in DBpedia
نویسندگان
چکیده
We describe a DBpedia extractor materializing the editing history of Wikipedia pages as linked data to support queries and indicators on the history. The different instances of the DBpedia platform typically extract RDF from Wikipedia using up to 16 extractors. The extraction focuses on structured content including infoboxes, categories, links, etc. As an example, the French chapter, of which we are responsible, extracted 185 million triples in 2015. The resulting RDF graph is then published and supports up to 2.5 million SPARQL queries per day and an average of 70,000 SPARQL queries per day in 2015. But Wikipedia is a social media that produces more data than the actual content of its pages. The activity of the epistemic communities of Wikipedia produces a huge amount of traces showing, for instance, the evolution, conflicts, trends, and variety of opinions of the users. In fact, the different projects of the Wikipedia Foundation develop at a rate of over ten edits per second, performed by users from all over the world. And this activity is performed on broad collection of topics: the English chapter of Wikipedia alone has over 5 million articles and the combined Wikipedias for all other languages exceed the English chapter in size with more than 27 billion words in 40 million articles in 293 languages. As a result the history of the editing actions captures the peaks and shifts of interests of the contributors and indirectly reflects the unfolding of events all around the world and in every domain. Providing means to monitor the editing activity has always been important for Wikipedians to follow the changes. These means include APIs such as the recent changes API, the IRC streams per languages, the WebSockets streams, the Server-Sent Events Streams, etc. [6]. Previous works also suggested to monitor real-time editing activity of Wikipedia to detect events such as natural disasters [1]. In [3] a resource versioning mechanism inspired from the Memento protocol (RFC7089) is applied but only to DBpedia dumps. In [4] historical versions of resources are regenerated for a given timestamp with some revision data but through a RESTful API. In [5] the preservation of the history of linked datasets is tested but only on a sample of 100,000 resources. We do not mention here works on formats, vocabularies or algorithms to detect and describe updates to RDF datasets since at this stage we are focusing on editing acts on Wikipedia. 1 French DBpedia Chapter http://fr.dbpedia.org/ 2 Wikipedia Statistics https://en.wikipedia.org/wiki/Wikipedia:Statistics 3 Comparisons accessed 23/08/16 https://en.wikipedia.org/wiki/Wikipedia:Size_comparisons Data about the activity provide historical indicators of interest, attention, over the set of resources they cover. They have been also used, for instance, to assess the currency of the data [7], to study conflict resolution [8], to temporally anchor data, to attribute changes and to identify vandalism [9] or to precisely attribute the authorship of content [10]. Inversely, using statements of other datasets (e.g. typing) one can filter and analyze the editing history considering chosen dimensions (e.g. focus on events about artists). But none of the previous contributions support public SPARQL querying of the full editing history. The potential of these linked data is even greater when combined with other linked data sources and this is not easily done with an API approach e.g. “give me the 10 most edited populated places in July 2012”. For this reason we designed and provide a new DBpedia extractor producing a linked data representation of the editing history of Wikipedia pages. Instead of real-time monitoring we capture the history as linked data to be able to query it, mine it and combine it with other sources to augment the dimensions we can exploit when querying linked data in general and DBpedia in particular. A history dump of a Wikipedia chapter contains all the modifications dating back from the inception of this linguistic chapter along with some information for each and every modification. As an example, the French editing history dump represents 2TB of uncompressed data. The data extraction is performed though streams in Node.js with a MongoDB instance. It took 4 days to extract 55 GB of RDF in turtle on 8 Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz with 68GB or RAM and using SSD disks. The result is then published through a SPARQL end-point with the DBpedia chapter. The extractor reuses as many existing vocabularies from the LOV directory as possible in order to facilitate integration and reuse. Figure 1 is a sample of the output of the edition history extractor for the page describing the author “Victor Hugo” in the DBpedia French chapter. The history data for such an entry contains one section of general information about the article history (lines 1-15) along with as many additional sections as there are previous revisions to capture each change (e.g. two revisions at lines 16-24). The general information about the article includes: the number of revisions (line 3), the date of creation and last modification (lines 4-5), the number of unique contributors (line 6), the number of revisions per year and per month (e.g. lines 7-8) and the average sizes of revisions per year and per month (e.g. lines 9-10). In addition each individual revision description includes: the date and time of the modification (e.g. lines 17), the size of the revision as a number of characters, (e.g. lines 18) the size of the modification as a number of characters (e.g. lines 19), the optional comment of the contributor (e.g. lines 20), the username or IP address of the contributor and if the contributor is a human or a bot (e.g. line 21 or 24) and a link to the previous revision (e.g. line 22). By construction the data are fully linked to the DBpedia resources and the vocabularies used include: PROV-O, Dublin Core, the Semantic Web Publishing Vocabulary, DBpedia ontologies, FOAF and SIOC. As a result the produced linked data are well integrated to the LOD cloud. Every time we were missing a predicate we added it to DBpedia FR ontology. As shown in Figure 2 these data support very 4 History endpoint http://dbpedia-historique.inria.fr/sparql 5 Linked Open Vocabularies (LOV) http://lov.okfn.org/ as accessed in June 2016 arbitrary queries such as, in this example, the ability to request most modified pages grouped by pairs of pages modified the same day. 1. a prov:Revision ; 2. dc:subject ; 3. swp:isVersion "3496"^^xsd:integer ; 4. dc:created "2002-06-06T08:48:32"^^xsd:dateTime ; 5. dc:modified "2015-10-15T14:17:02"^^xsd:dateTime ; 6. dbfr:uniqueContributorNb 1295 ; (...) 7. dbfr:revPerYear [ dc:date "2015"^^xsd:gYear ; rdf:value "79"^^xsd:integer ] ; 8. dbfr:revPerMonth [ dc:date "06/2002"^^xsd:gYearMonth ; rdf:value "3"^^xsd:integer ] ; (...) 9. dbfr:averageSizePerYear [ dc:date "2015"^^xsd:gYear ; rdf:value "154110.18"^^xsd:float ] ; 10. dbfr:averageSizePerMonth [ dc:date "06/2002"^^xsd:gYearMonth ; rdf:value "2610.66"^^xsd:float ] ; (...) 11. dbfr:size "159049"^^xsd:integer ; 12. dc:creator [ foaf:nick "Rinaldum" ] ; 13. sioc:note "wikification"^^xsd:string ; 14. prov:wasRevisionOf ; 15. prov:wasAttributedTo [ foaf:name "Rémih" ; a prov:Person, foaf:Person ] . 16. a prov:Revision ; 17. dc:created "2015-09-29T19:35:34"^^xsd:dateTime ; 18. dbfr:size "159034"^^xsd:integer ; 19. dbfr:sizeNewDifference "-5"^^xsd:integer ; 20. sioc:note "/*Années théâtre*/ neutralisation"^^xsd:string ; 21. prov:wasAttributedTo [ foaf:name "Thouny" ; a prov:Person, foaf:Person ] ; 22. prov:wasRevisionOf . (...) 23. a prov:Revision ; 24. prov:wasAttributedTo [ foaf:name "OrlodrimBot" ; a prov:SoftwareAgent ] ; (...) Fig. 1. Extract of the output of the edition history extractor for Victor Hugo 1. PREFIX dc: 2. PREFIX prov: 3. PREFIX swp: 4. select distinct ?x ?y ?d where 5. { ?x a prov:Revision . 6. ?y a prov:Revision . 7. ?x dc:modified ?d . 8. ?y dc:modified ?d . 9. ?x swp:isVersion ?v . 10. FILTER (?v>1000 && ?x<?y) } LIMIT 10 Fig. 2. Ten of the most modified pairs of pages modified the same day The STTL template language [2] allows to generate portals in a declarative and fast way. We used it to build two portals to show the richness of the historical data materialized. The first application designed is a visual history browser that displays images of the 50 most edited topics for every month. With the second portal we demonstrate the ability to join this new dataset with other linked data sources starting with DBpedia itself: we built a focused portal generator that reduces the monitoring activity to specific DBpedia categories of resources (e.g. companies, actors, countries, 6 Category-filtered view of the History: Focusing on artists (mode=dbo:Artist) http://corese.inria.fr/srv/template?profile=st:dbedit&mode;=dbo:Artist Or focusing on countries (mode=dbo:Country) using the DBpedia ontology http://corese.inria.fr/srv/template?profile=st:dbedit&mode;=dbo:Country etc.). Figure 3 is a screenshot of the portal focused on countries and shows the events in Ukraine in 2014. Many applications of the editing activity already exist [6] and these two portals are only a proof of concept for what can be done with SPARQL over the linked data of editing activity. Fig. 3. Portal showing Countries whose page was subject to a maximum activity. The history extractor is now integrated to the DBpedia open-source code and running on the production server of the French chapter. We are studying the integration of the live change feed for both the chapter and its history in order to reflect real-time changes to the content and editing logs. We are also considering ways to represent more precisely the changes between two revisions.
منابع مشابه
Modelling provenance of DBpedia resources using Wikipedia contributions
DBpedia is one of the largest datasets in the Linked Open Data cloud. Its centrality and its cross-domain nature makes it one of the most important and most referred to knowledge bases on the Web of Data, generally used as a reference for data interlinking. Yet, in spite of its authoritative aspect, there is no work so far tackling the provenance aspect of DBpedia statements. By being extracted...
متن کاملExtending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia
DBpedia is a Semantic Web project aiming to extract structured data from Wikipedia articles. Due to the increasing number of resources linked to it, DBpedia plays a central role in the Linked Open Data community. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of subject-attribute-value triples that represents a summary of the Wikipedia page. ...
متن کاملDBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available using Semantic Web and Linked Data standards. The extracted knowledge, comprising more than 1.8 billion facts, is structured according to an ontology maintained by the community. The knowledge is obtained from different Wikipedia language editions, thus covering more than 100 l...
متن کاملApproaches for Automatically Enriching Wikipedia
We have been exploring the use of Web-derived knowledge bases through the development of Wikitology a hybrid knowledge base of structured and unstructured information extracted from Wikipedia augmented by RDF data from DBpedia and other Linked Open Data resources. In this paper, we describe approaches that aid in enriching Wikipedia and thus the resources that derive from Wikipedia such as the ...
متن کاملWikidata through the Eyes of DBpedia
DBpedia is one of the first and most prominent nodes of the Linked Open Data cloud. It provides structured data for more than 100 Wikipedia language editions as well as Wikimedia Commons, has a mature ontology and a stable and thorough Linked Data publishing lifecycle. Wikidata, on the other hand, has recently emerged as a user curated source for structured information which is included in Wiki...
متن کامل